智能论文笔记

Location reference recognition from texts: A survey and comparison

Xuke Hu , Zhiyong Zhou , Hao Li , Yingjie Hu , Fuqiang Gu , Jens Kersten , Hongchao Fan , Friederike Klan

分类：自然语言处理

2022-07-04

非结构化的文本中存在大量的位置信息，例如社交媒体帖子，新闻报道，科学文章，网页，旅行博客和历史档案。地理学是指识别文本中的位置参考并识别其地理空间表示的过程。虽然地理标准可以使许多领域受益，但仍缺少特定应用程序的摘要。此外，缺乏对位置参考识别方法的现有方法的全面审查和比较，这是地理验证的第一个和核心步骤。为了填补这些研究空白，这篇综述首先总结了七个典型的地理应用程序域：地理信息检索，灾难管理，疾病监视，交通管理，空间人文，旅游管理和犯罪管理。然后，我们通过将这些方法分类为四个组，以基于规则的基于规则，基于统计学学习的基于统计学学习和混合方法将这些方法分类为四个组，从而回顾了现有的方法参考识别方法。接下来，我们彻底评估了27种最广泛使用的方法的正确性和计算效率，该方法基于26个公共数据集，其中包含不同类型的文本（例如，社交媒体帖子和新闻报道），包含39,736个位置参考。这项彻底评估的结果可以帮助未来的方法论发展以获取位置参考识别，并可以根据应用需求指导选择适当方法的选择。

translated by 谷歌翻译

Surrogate-based cross-correlation for particle image velocimetry

Yong Lee , Fuqiang Gu , Zeyu Gong

分类：计算机视觉

2021-12-10

本文介绍了一种基于代理的互相关（SBCC）框架，以提高两个图像信号之间的相关性能。 SBCC背后的基本思想是，提供一个原始图像的优化代理滤波器/图像将产生更强大的且更准确的相关信号。 SBCC的互相关估计与由替代丢失和相关稠度损失组成的目标函数。闭合溶液提供有效的估计。为了我们的意外，SBCC框架可以提供替代视图来解释一组广义互相关（GCC）方法并理解参数的含义。在我们的SBCC框架的帮助下，我们进一步提出了四种新的特定互联方法，并提供了一些提高现有GCC方法的建议。值得注意的事实是，通过纳入其他否定的上下文图像，SBCC可以增强相关性鲁棒性。考虑到粒子图像VELOCIMETRY（PIV）的子像素精度和鲁棒性要求，用粒子图像研究了目标函数中的每个术语的贡献。与最先进的基线方法相比，SBCC方法在合成数据集中表现出改善的性能（准确性和鲁棒性）和一些具有挑战性的真实实验PIV病例。

translated by 谷歌翻译

A Comparative Study of Image Disguising Methods for Confidential Outsourced Learning

Sagar Sharma , Yuechun Gu , Keke Chen

分类：机器学习

2022-12-31

Large training data and expensive model tweaking are standard features of deep learning for images. As a result, data owners often utilize cloud resources to develop large-scale complex models, which raises privacy concerns. Existing solutions are either too expensive to be practical or do not sufficiently protect the confidentiality of data and models. In this paper, we study and compare novel \emph{image disguising} mechanisms, DisguisedNets and InstaHide, aiming to achieve a better trade-off among the level of protection for outsourced DNN model training, the expenses, and the utility of data. DisguisedNets are novel combinations of image blocktization, block-level random permutation, and two block-level secure transformations: random multidimensional projection (RMT) and AES pixel-level encryption (AES). InstaHide is an image mixup and random pixel flipping technique \cite{huang20}. We have analyzed and evaluated them under a multi-level threat model. RMT provides a better security guarantee than InstaHide, under the Level-1 adversarial knowledge with well-preserved model quality. In contrast, AES provides a security guarantee under the Level-2 adversarial knowledge, but it may affect model quality more. The unique features of image disguising also help us to protect models from model-targeted attacks. We have done an extensive experimental evaluation to understand how these methods work in different settings for different datasets.

translated by 谷歌翻译

Translating Text Synopses to Video Storyboards

Xu Gu , Yuchong Sun , Feiyue Ni , Shizhe Chen , Ruihua Song , Boyuan Li , Xiang Cao

分类：计算机视觉

2022-12-31

A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards however remains challenging which not only requires association between high-level texts and images, but also demands for long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images to visualize the text synopsis. We construct a MovieNet-TeViS benchmark based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes that are manually selected from corresponding movies by considering both relevance and cinematic coherence. We also present an encoder-decoder baseline for the task. The model uses a pretrained vision-and-language model to improve high-level text-image matching. To improve coherence in long-term shots, we further propose to pre-train the decoder on large-scale movie frames without text. Experimental results demonstrate that our proposed model significantly outperforms other models to create text-relevant and coherent storyboards. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.

translated by 谷歌翻译

Pontryagin Optimal Controller via Neural Networks

Chengyang Gu , Yize Chen

分类：机器学习

2022-12-30

Solving real-world optimal control problems are challenging tasks, as the system dynamics can be highly non-linear or including nonconvex objectives and constraints, while in some cases the dynamics are unknown, making it hard to numerically solve the optimal control actions. To deal with such modeling and computation challenges, in this paper, we integrate Neural Networks with the Pontryagin's Minimum Principle (PMP), and propose a computationally efficient framework NN-PMP. The resulting controller can be implemented for systems with unknown and complex dynamics. It can not only utilize the accurate surrogate models parameterized by neural networks, but also efficiently recover the optimality conditions along with the optimal action sequences via PMP conditions. A toy example on a nonlinear Martian Base operation along with a real-world lossy energy storage arbitrage example demonstrates our proposed NN-PMP is a general and versatile computation tool for finding optimal solutions. Compared with solutions provided by the numerical optimization solver with approximated linear dynamics, NN-PMP achieves more efficient system modeling and higher performance in terms of control objectives.

translated by 谷歌翻译

NeMo: 3D Neural Motion Fields from Multiple Video Instances of the Same Action

Kuan-Chieh Wang , Zhenzhen Weng , Maria Xenochristou , Joao Pedro Araujo , Jeffrey Gu , C. Karen Liu , Serena Yeung

分类：计算机视觉

2022-12-28

The task of reconstructing 3D human motion has wideranging applications. The gold standard Motion capture (MoCap) systems are accurate but inaccessible to the general public due to their cost, hardware and space constraints. In contrast, monocular human mesh recovery (HMR) methods are much more accessible than MoCap as they take single-view videos as inputs. Replacing the multi-view Mo- Cap systems with a monocular HMR method would break the current barriers to collecting accurate 3D motion thus making exciting applications like motion analysis and motiondriven animation accessible to the general public. However, performance of existing HMR methods degrade when the video contains challenging and dynamic motion that is not in existing MoCap datasets used for training. This reduces its appeal as dynamic motion is frequently the target in 3D motion recovery in the aforementioned applications. Our study aims to bridge the gap between monocular HMR and multi-view MoCap systems by leveraging information shared across multiple video instances of the same action. We introduce the Neural Motion (NeMo) field. It is optimized to represent the underlying 3D motions across a set of videos of the same action. Empirically, we show that NeMo can recover 3D motion in sports using videos from the Penn Action dataset, where NeMo outperforms existing HMR methods in terms of 2D keypoint detection. To further validate NeMo using 3D metrics, we collected a small MoCap dataset mimicking actions in Penn Action,and show that NeMo achieves better 3D reconstruction compared to various baselines.

translated by 谷歌翻译

Do DALL-E and Flamingo Understand Each Other?

Hang Li , Jindong Gu , Rajat Koner , Sahand Sharifzadeh , Volker Tresp

分类：计算机视觉 | 机器学习

2022-12-23

A major goal of multimodal research is to improve machine understanding of images and text. Tasks include image captioning, text-to-image generation, and vision-language representation learning. So far, research has focused on the relationships between images and text. For example, captioning models attempt to understand the semantics of images which are then transformed into text. An important question is: which annotation reflects best a deep understanding of image content? Similarly, given a text, what is the best image that can present the semantics of the text? In this work, we argue that the best text or caption for a given image is the text which would generate the image which is the most similar to that image. Likewise, the best image for a given text is the image that results in the caption which is best aligned with the original text. To this end, we propose a unified framework that includes both a text-to-image generative model and an image-to-text generative model. Extensive experiments validate our approach.

translated by 谷歌翻译

GAN-based Domain Inference Attack

Yuechun Gu , Keke Chen

分类：机器学习 | 人工智能

2022-12-22

Model-based attacks can infer training data information from deep neural network models. These attacks heavily depend on the attacker's knowledge of the application domain, e.g., using it to determine the auxiliary data for model-inversion attacks. However, attackers may not know what the model is used for in practice. We propose a generative adversarial network (GAN) based method to explore likely or similar domains of a target model -- the model domain inference (MDI) attack. For a given target (classification) model, we assume that the attacker knows nothing but the input and output formats and can use the model to derive the prediction for any input in the desired form. Our basic idea is to use the target model to affect a GAN training process for a candidate domain's dataset that is easy to obtain. We find that the target model may distract the training procedure less if the domain is more similar to the target domain. We then measure the distraction level with the distance between GAN-generated datasets, which can be used to rank candidate domains for the target model. Our experiments show that the auxiliary dataset from an MDI top-ranked domain can effectively boost the result of model-inversion attacks.

translated by 谷歌翻译

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Jay Zhangjie Wu , Yixiao Ge , Xintao Wang , Weixian Lei , Yuchao Gu , Wynne Hsu , Ying Shan , Xiaohu Qie , Mike Zheng Shou

分类：计算机视觉

2022-12-22

To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problem$\unicode{x2014}$One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.

translated by 谷歌翻译

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Vivek Rathod , Bryan Seybold , Sudheendra Vijayanarasimhan , Austin Myers , Xiuye Gu , Vighnesh Birodkar , David A. Ross

分类：计算机视觉

2022-12-20

Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models. We show that the performance can be further improved by ensembling the image-text features with features encoding local motion, like optical flow based features, or other modalities, like audio. In addition, we propose a more reasonable open-vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.

translated by 谷歌翻译